1 Introduction

We are using the same dataset from our first project. The dataset contains 9578 observations and 14 variables and is based on loans from LendingClub, a peer-to-peer lending company, that were made from 2007 to 2015. In summary, our EDA found:

  1. The credit underwriting criteria of LendingClub is proven to be effective as borrowers who do not meet the credit underwriting criteria are more than twice as likely to default in comparison to borrowers who do meet the criteria.
  2. There are some proven relationships between credit.policy and other numeric variables such as int.rate, fico, and inq.last.6mths. Additionally, there was no clearly established relationships between not.fully.paid and other numeric variables (except for credit.policy).
  3. There are statistically significant relationships between the categorical and logical variables purpose, credit.policy, and not.fully.paid.
  4. For all numeric variables except for delinq.2yrs, their mean significantly varies for different categories ofpurpose.

We obtained the dataset from Kaggle here: https://www.kaggle.com/datasets/urstrulyvikas/lending-club-loan-data-analysis

Our work is stored on our team GitHub here: https://github.com/jschild01/JMB_DATS_6101

2 Preparing the Dataset for Modeling

2.1 Converting Variables

From our EDA in project 1 we know that the credit.policy and not.fully.paid variables function as logicals, and the purpose variable functions as a factor with 7 levels. We will start by formally converting these variables from numerical date-types in R to logicals and the factor date-type, respectively.

# We determined this makes sense to do from our EDA
str(loans) # keep??
loans$credit.policy <- as.logical(loans$credit.policy)
loans$not.fully.paid <- as.logical(loans$not.fully.paid)

loans$purpose <- as.factor(loans$purpose)

Of the 14 variables in the dataset, 11 are numeric, 2 are logical, and 1 is a factor.

2.2 Adding Additional Logicals

From our EDA we saw that the overwhelming majority of loans had a value of 0 for delinq.2yrs and pub.rec. The variables might be more useful in terms of having stronger correlations and a larger impact in models if we create logical versions of these variables. We will introduce has.delinq.2yrs and has.pub.rec based on whether or not the value is 0 or greater than 0.

# It might help to turn these variables into logicals since most are 0
loans <- loans %>%
  mutate(has.delinq.2yrs = delinq.2yrs > 0,
         has.pub.rec = pub.rec > 0)

2.3 Revolving Balance Transformation

From our EDA we saw that the revol.bal variable had a wide range of values as well as outlier issues. We wondered if it would work better if we took the natural log of it, similar to how the income provided in the dataset the dataset (the log.annual.inc variable) is in the form of the natural log. However, some values for revol.bal are 0, and taking the natural log of that results in -Inf. One way to deal with this without removing those values is to add 1 to all of the values for revol.bal before taking the natural log.

Given the range of these values (the IQr is 1.50625^{4}) adding one should have a negligible effect, and adding one to all values keeps the data consistent. In case it is significant that the revol.bal is 0 we will add a logical variable has.revol.bal based on whether the value of revol.bal is 0 or greater than 0.

# Some loans have a revol.bal value of 0, the natural log of which is -Inf
# Adding 1 to all values before taking the natural log resolves this issue by nudging the value a very small amount
# In case it was meaningful if the loan had a revol.bal value of 0 versus >0, we can add a logical to account for that
loans <- loans %>%
  mutate(log.revol.bal = log(revol.bal+1),
         has.revol.bal = revol.bal > 0)

2.4 Utility of the new variables

Now that we have added some new variables, it would be helpful to look at a correlation plot and make some comparisons to assess whether they would be more useful going forward than their original forms.

# A new correlation matrix with the new and transformed variables
# For our correlation matrix we want to include everything but the purpose variable
# We can put the new variables together at the top
loans_correlation_matrix_new <- loans %>%
  select(-purpose) %>%
  select(revol.bal, log.revol.bal, has.revol.bal, has.delinq.2yrs, has.pub.rec, everything()) %>%
  cor()

# The mixed correlation plot makes a nice visualization
corrplot.mixed(loans_correlation_matrix_new, tl.pos = "lt")

Some notable observations of this: 1. has.delinq.2yrs doesn’t have any strong correlations, and the 2 potentially useful ones (int.rate and fico) are almost exactly the same as delinq.2yrs. 2. Similarly, has.pub.rec doesn’t have any strong correlations, and the 2 potentially useful ones (int.rate and fico) are the same as pub.rec. 3. log.revol.bal has a stronger correlation with dti and a much stronger correlation with revol.util compared to revol.bal. 4. log.revol.bal has a weaker correlation with credit.policy compared to revol.bal. 5. has.revol.bal does not have a significant correlation with anything (log.revol.bal excepted) except revol.util, where the correlation is significantly weaker than that of log.revol.bal.

From this we can conclude that converting delinq.2yrs and pub.rec to logicals did not add any advantage for our dataset. We can also conclude that the has.revol.bal variable is not necessary and will not add value to our analysis beyond what log.revol.bal already covers. Successfully eliminating these as possibilities increases our confidence in the use of their original structures.

Lastly, we see that taking the natural log of revol.bal might be useful, as log.revol.bal has some stronger correlations than revol.bal. We will note this as we proceed with our modeling.

2.5 Updated Visuals

These are updated histograms, boxplots, and Q-Q plots with the log.revol.bal added variable. We will use these to briefly review the EDA we did for the first project.

2.5.1 Histograms

2.5.2 Boxplots

2.5.3 Q-Q Plots

3 Classification Tree

3.1 Preparing and Building the Tree

Since one of our variables of intrest, credit.policy is a logical, we can treat is as a 2-level factor and build a classification tree model using the variables it has significant correlation with, in this case int.rate, fico, and inq.last.6mths.

For this variable we are interested in the loans where the value is FALSE. That is, loans that do not meet the credit underwriting criteria, and therefore are more at risk of defaulting. These loans are in the minority. We will rename the FALSE values to Fail, indicating that the loan fails to meet the underwriting criteria, and rename the TRUE values to Meets, indicating that the loan meets the underwriting criteria.

Building the classification tree, we can see that int.rate was not used in the model, but fico, and inq.last.6mths were.

# Since we are renaming some values let's make a separate dataset for this model
loans_tree <- loans

# First convert the credit.policy variable from a logical to a factor, then rename the levels
loans_tree$credit.policy <- as.factor(loans_tree$credit.policy) %>%
  fct_recode(Meets = "TRUE", Fails = "FALSE")

# Max depth of 4 based on Professor Faruqe's instructions
creditfit <- rpart(credit.policy ~ int.rate + fico + inq.last.6mths, data=loans_tree, method="class", control = list(maxdepth = 4) )
printcp(creditfit) # display the results 
## 
## Classification tree:
## rpart(formula = credit.policy ~ int.rate + fico + inq.last.6mths, 
##     data = loans_tree, method = "class", control = list(maxdepth = 4))
## 
## Variables actually used in tree construction:
## [1] fico           inq.last.6mths
## 
## Root node error: 1868/9578 = 0.19503
## 
## n= 9578 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.484475      0   1.00000 1.00000 0.020759
## 2 0.187366      1   0.51552 0.51552 0.015755
## 3 0.076017      2   0.32816 0.32816 0.012823
## 4 0.010000      3   0.25214 0.25214 0.011329
plotcp(creditfit) # visualize cross-validation results

# Show these results??
summary(creditfit) # detailed summary of splits

To add: comments on results

3.2 Visualizing the Tree

We can better visualize the tree by plotting it out. We can see the inq.last.6mths threshold of greater than or equal to four, as well as the two thresholds of 740 and 660 or FICO scores. It details that of our 9578 loan sample, there are 1868 loans that failed to meet the credit.policy and 7710 that did.

# plot the tree and add text
rpart.plot(creditfit, uniform=TRUE, main="Classification Tree for Credit.Policy", digits = 3, extra = 1)

#plot(creditfit, uniform=TRUE, main="Classification Tree for credit.policy")
#text(creditfit, use.n=TRUE, all=TRUE, cex=.75)

Of all of the loans, 1231 had more greater than or equal to four inq.last.6mths. Of those 1231:

  • 1047 had a FICO lower than 740 and failed to meet the credit.policy
  • 21 had a FICO score 740 or higher and failed to meet the credit.policy
  • 163 had a FICO score 740 or higher and met the credit.policy

This indicates that applicants will likely need a FICO score of 740 or higher if they have greater than or equal to four inq.last.6mths in order to anticipate meeting the credit.policy.

Conversely, of all of the loans, 8347 had less than four inq.last.6mths. Of those 8347:

  • 352 had a FICO lower than 660 and failed to meet the credit.policy
  • 2 had a FICO lower than 660 but still managed to meet the credit.policy
  • 448 had a FICO score 660 or higher and failed to meet the credit.policy
  • 7545 had a FICO score 660 or higher and met the credit.policy

This indicates that applicants will likely need a FICO score of only 660 or higher if they have less than four inq.last.6mths in order to anticipate meeting the credit.policy.

Overall, we can see that less inq.last.6mths aligns with lower FICO score requirements for the credit.policy. Generally, if applicants have equal to or more than four inq.last.6mths, then they will likely need a FICO score of 740 or higher to meet the credit.policy. On the other hand, if applicants have less than four inq.last.6mths, then they will likely only need a FICO score of 660 or higher to meet the credit.policy. If applicants have a FICO score lower than 660, then they are unlikely to meet the credit.policy, regardless.

# create a postscript plot of tree 
# The rpart.plot() function looks quite nice, so we may not need this
post(creditfit, file = "credittree2.ps", title = "Classification Tree for credit.policy")

3.3 Confusion Matrix

To better understand how our model performs we can construct a confusion matrix of the results. The Fails values will function as the positive cases for the confusion matrix we put together based on the results of the classification tree model.

# Creating the confusion matrix
confusion_matrix = confusionMatrix(predict(creditfit, type = "class"), reference = loans_tree$credit.policy)
print('Overall: ')
## [1] "Overall: "
confusion_matrix$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.508248e-01   8.269988e-01   9.463031e-01   9.550698e-01   8.049697e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00  2.836114e-102
print('Class: ')
## [1] "Class: "
confusion_matrix$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.7489293            0.9997406            0.9985724 
##       Neg Pred Value            Precision               Recall 
##            0.9426440            0.9985724            0.7489293 
##                   F1           Prevalence       Detection Rate 
##            0.8559192            0.1950303            0.1460639 
## Detection Prevalence    Balanced Accuracy 
##            0.1462727            0.8743350
#The Kappa statistic (or value) is a metric that compares an Observed Accuracy with an Expected Accuracy (random chance). The kappa statistic is used not only to evaluate a single classifier, but also to evaluate classifiers amongst themselves.According to their scheme a value < 0 is indicating no agreement , 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement. 

Based on the Kappa statistic of 0.83 we would evaluate this classifier as almost perfect agreement.

To add: other comments on the overall stats of the model.

From the Trees.rmd file: The overall accuracy is 95.08%. These are the same metrics of sensitivity (also known as recall rate, TP / (TP+FN) ), specificity (TN / (TN+FP) ), F1 score, and others that we used in Logistic Regression and KNN analyses. Indeed, any “classifiers” can use the confusion matrix approach as one of the evaluation tools.

Let’s put together tables to visualize the confusion matrix as well some statistics on it.

# Creating the confusion matrix table
# To get a table that follows the same format as our notes, with the rows representing true values and columns representing predicted values, we will have to adjust the data a bit.

# Start by converting to a data.frame
cm_table <- as.data.frame(confusion_matrix$table) %>%
  rename(Actual = "Reference")

# Make it so that the rows represent true values and columns represent predicted values
cm_table$Prediction <- paste0("Prediction - ", cm_table$Prediction)
cm_table <- spread(cm_table, Prediction, Freq)

# Output a nice table
# xkabledply(cm_table, title = "Confusion matrix for the tree model")
cm_table %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
Actual Prediction - Fails Prediction - Meets
Fails 1399 469
Meets 2 7708
## Can we bold the values in the first column ("Actual") using kable??
#Creating a table for the confusion matrix statistics
# Start by converting to a data.frame
cm_stats <- as.data.frame(confusion_matrix$byClass) %>%
  rownames_to_column()

# Adjust the column names and values to look better
colnames(cm_stats) <- c("Statistic", "Percentage")
cm_stats$Percentage <- paste0(round(cm_stats$Percentage*100, digits=1), "%")

# Output a nice table
# xkabledply(cm_stats, title = "Confusion matrix statistics for the tree model")
cm_stats %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)
Statistic Percentage
Sensitivity 74.9%
Specificity 100%
Pos Pred Value 99.9%
Neg Pred Value 94.3%
Precision 99.9%
Recall 74.9%
F1 85.6%
Prevalence 19.5%
Detection Rate 14.6%
Detection Prevalence 14.6%
Balanced Accuracy 87.4%

For certain statistics this model performs extremely well, with near 100% specificity, precision, and positive predictive value. Negative predictive value is also very high at around 94%.

To add: further comments on the statistics of the model.

3.4 ROC Curve

Can we do this for this type of model?

library(pROC)

# loans_tree$prediction <- predict(creditfit, type = "vector")
loans_tree$prediction <- predict(creditfit, type = "prob")[,2]
tree_roc <- roc(credit.policy ~ prediction, data=loans_tree)
auc(tree_roc) # area-under-curve prefer 0.8 or higher.
plot(tree_roc)

The AUC of the ROC is 0.8773732, which is greater than the 0.8 threshold, so we would consider this to be a good indicator/model.

4 Simple Linear Regression

4.1 Interest Rate vs FICO Score

# fit linear model
linear_model <- lm(int.rate ~ fico, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(int.rate ~ fico, data = loans)

loans %>%
  ggplot(aes(x = fico, y = int.rate)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm", se = T) +
  labs(title = "Interest Rate vs FICO Score",
       x = "FICO Score", y = "Interest Rate") +
  scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
  scale_y_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
  theme_minimal()

4.2 Revolving Utilization Rate vs Interest Rate

# fit linear model
linear_model <- lm(revol.util ~ int.rate, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(revol.util ~ int.rate, data = loans)

loans %>%
  ggplot(aes(x = int.rate, y = revol.util)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "Revolving Line Utilization Rate vs Interest Rate",
       x = "Interest Rate", y = "Revolving Line Utilization Rate") +
  scale_x_continuous(labels = label_percent(), limits = c(.05, NA), expand = expansion(mult = c(0, .05))) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  theme_minimal()

4.3 Installment vs Natural Log of Annual Income

# fit linear model
linear_model <- lm(installment ~ log.annual.inc, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(installment ~ log.annual.inc, data = loans)

loans %>%
  ggplot(aes(x = log.annual.inc, y = installment)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "Installment vs Log of Annual Income",
       x = "Log of Annual Income", y = "Installment") +
  theme_minimal()

4.4 Revolving Utilization Rate vs FICO Score

# fit linear model
linear_model <- lm(revol.util ~ fico, data=loans)
  
# view summary of linear model
summary(linear_model)

# scatterplot with regression line and confidence interval
# scatterplot(revol.util ~ fico, data = loans)

loans %>%
  ggplot(aes(x = fico, y = revol.util)) +
  geom_point(color = "steelblue", alpha = 0.2) +
  geom_smooth(method = "lm") +
  labs(title = "Revolving Line Utilization Rate vs FICO Score",
       x = "FICO Score", y = "Revolving Line Utilization Rate") +
  scale_x_continuous(limits = c(600, NA), expand = expansion(mult = c(0, .05))) +
  scale_y_continuous(labels = label_percent(scale = 1)) +
  theme_minimal()

5 Multiple Linear Regression

set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$not.fully.paid)
model <- train(int.rate~., data=loans, method="lm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

#names(getModelInfo())
model <- lm(int.rate ~ fico+installment+purpose+revol.util+log.revol.bal+credit.policy+log.annual.inc, data = loans)
summary(model)
model$coefficients
model$call
coef(model)
##               (Intercept)                      fico               installment 
##              4.567581e-01             -4.773984e-04              4.332010e-05 
##        purposecredit_card purposedebt_consolidation        purposeeducational 
##             -2.785319e-03             -7.310921e-04              4.420619e-04 
##   purposehome_improvement     purposemajor_purchase     purposesmall_business 
##              2.647767e-03              1.561053e-03              1.634820e-02 
##                revol.util             log.revol.bal         credit.policyTRUE 
##              1.296153e-04             -1.361040e-03             -4.226653e-03 
##            log.annual.inc 
##             -1.589963e-05
confint(model)
##                                   2.5 %        97.5 %
## (Intercept)                4.473561e-01  4.661601e-01
## fico                      -4.881130e-04 -4.666839e-04
## installment                4.156564e-05  4.507456e-05
## purposecredit_card        -3.874981e-03 -1.695657e-03
## purposedebt_consolidation -1.570762e-03  1.085780e-04
## purposeeducational        -1.325965e-03  2.210089e-03
## purposehome_improvement    1.263426e-03  4.032108e-03
## purposemajor_purchase     -2.844394e-05  3.150550e-03
## purposesmall_business      1.493889e-02  1.775752e-02
## revol.util                 1.145579e-04  1.446727e-04
## log.revol.bal             -1.541020e-03 -1.181060e-03
## credit.policyTRUE         -5.071373e-03 -3.381934e-03
## log.annual.inc            -6.148037e-04  5.830045e-04
library("broom")

tidyfinal <-  tidy(model)
tidyfinal

Model_Summary <- augment(model)
str(Model_Summary)
head(Model_Summary)
par(mfrow=c(2,2))
plot(model)

vif(model)
lmtest::bptest(model)
car::ncvTest(model)
distBCMod <- caret::BoxCoxTrans(loans$int.rate)
print(distBCMod)
loans <- cbind(loans, dist_new=predict(distBCMod, loans$int.rate)) 
head(loans) 
mod1 <- lm(dist_new ~ fico+installment+purpose+revol.util+log.revol.bal+credit.policy+log.annual.inc, data=loans)
summary(mod1)
lmtest::bptest(mod1)
par(mfrow=c(2,2))
plot(mod1)

6 Logistic Regression Credit Policy

6.1 Effects on Admission by purpose

To study the effects on admission by the factor purpose (credit.policy and purpose are both categorical variables), wee can create two-way contingency table of the outcome and predictors, and make sure there are no cells of zero frequencies.
*A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.

credit.policypurposetable = xtabs(~ credit.policy + purpose, data = loans)
credit.policypurposetable

6.2 feature selection

set.seed(7)
# load the library
library(mlbench)
library(caret)
# load the dataset
# prepare training scheme
loans$credit.policy = as.factor(loans$credit.policy)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$credit.policy)
model <- train(credit.policy~., data=loans, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

#names(getModelInfo())

6.3 Logistic Regression Model

<<<<<<< Updated upstream

loans$credit.policy <- factor(loans$credit.policy)
str(loans)
loans$purpose <- factor(loans$purpose)
#str(credit.policy)
credit.policyLogit <- glm(credit.policy ~ . , data = loans, binomial(link = "logit") )  

We can see the summary of the logit model here:

summary(credit.policyLogit)
xkabledply(credit.policyLogit, title = "Logistic Regression :")
Logistic Regression :
Estimate Std. Error z value Pr(>|z|)
(Intercept) 165.3372 28.2865 5.8451 0.0000
purposecredit_card -0.1027 0.1364 -0.7531 0.4514
purposedebt_consolidation 0.1065 0.1045 1.0197 0.3079
purposeeducational 0.1273 0.2041 0.6237 0.5328
purposehome_improvement 0.1585 0.1822 0.8700 0.3843
purposemajor_purchase 0.2299 0.2116 1.0867 0.2772
purposesmall_business 0.3742 0.1888 1.9825 0.0474
int.rate -278.4966 39.5644 -7.0391 0.0000
installment 0.0005 0.0003 1.9652 0.0494
log.annual.inc 0.6609 0.0829 7.9679 0.0000
dti -0.0108 0.0064 -1.6926 0.0905
fico 0.0495 0.0023 21.4435 0.0000
days.with.cr.line 0.0001 0.0000 5.2242 0.0000
revol.bal -0.0001 0.0000 -23.0655 0.0000
revol.util -0.0003 0.0019 -0.1518 0.8794
inq.last.6mths -1.0258 0.0279 -36.7482 0.0000
delinq.2yrs -0.4409 0.1114 -3.9588 0.0001
pub.rec -0.8341 0.3868 -2.1565 0.0310
not.fully.paidTRUE -0.2305 0.0969 -2.3784 0.0174
has.delinq.2yrsTRUE 0.7170 0.1985 3.6118 0.0003
has.pub.recTRUE 1.0311 0.4515 2.2835 0.0224
log.revol.bal 0.3949 0.0447 8.8373 0.0000
has.revol.balTRUE -1.6722 0.3780 -4.4242 0.0000
dist_new 154.9980 21.5896 7.1793 0.0000

Before moving on, let us look at the model object credit.policyLogit a little deeper. The fitted values can be found from credit.policyLogit$fitted.values. And the first fitted value is 0.9982039. This is the probability of being credit.policyted for data point #1. Compare to the value from predict(credit.policyLogit) to be 6.3203648.

The predict() function gives us the logit value. You can exponentiate to get the odds ratio p/q as 555.7757149. And finally, we can find p from p/q, and indeed it is confirmed to be 0.9982039.

The easier way to get that is simply use predict(credit.policyLogit, type = c("response"))[1] = 0.9982039. The predict() function will also allow you to find model prediction with unseen/untrained data points where fitted.values do not give.

p_fitted = credit.policyLogit$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)  

This is stored in the model as the fitted value for the probability p of the first data point. Since the actual data point is a 0/1 True/False value, it is not easy to directly compare them unless we use a cutoff value (default as 0.5) to convert the probability p to 0/1.

Now, for unseen data point, we can use the predict( ) function to find the model predicted values. But be careful of what you are getting with the predict() function in classification models. Let us compare the three options below. For easy comparison, let us use the same data point in the dataset as an example.

# This gives you the predicted values of the data points inside the model.
predict(credit.policyLogit)  # the is from the model, which gives you the value for logit(p) or ln(p/q) 

6.3.1 Confidence Intervals

We can easily determine the confidence intervals of each coefficient with these two slightly different ways:

<<<<<<< Updated upstream

## CIs using profiled log-likelihood
# confint(credit.policyLogit)
xkabledply( confint(credit.policyLogit), title = "CIs using profiled log-likelihood" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) 109.7610 220.6595
purposecredit_card -0.3692 0.1656
purposedebt_consolidation -0.0987 0.3110
purposeeducational -0.2680 0.5326
purposehome_improvement -0.1948 0.5196
purposemajor_purchase -0.1772 0.6530
purposesmall_business 0.0078 0.7481
int.rate -355.9522 -200.8374
installment 0.0000 0.0010
log.annual.inc 0.4990 0.8241
dti -0.0233 0.0017
fico 0.0450 0.0540
days.with.cr.line 0.0001 0.0001
revol.bal -0.0001 -0.0001
revol.util -0.0040 0.0034
inq.last.6mths -1.0814 -0.9719
delinq.2yrs -0.6640 -0.2293
pub.rec -1.5571 -0.0037
not.fully.paidTRUE -0.4196 -0.0397
has.delinq.2yrsTRUE 0.3361 1.1134
has.pub.recTRUE 0.0889 1.8886
log.revol.bal 0.3070 0.4822
has.revol.balTRUE -2.4107 -0.9286
dist_new 112.6293 197.2715
## CIs using standard errors
# confint.default(credit.policyLogit)
xkabledply( confint.default(credit.policyLogit), title = "CIs using standard errors" )
CIs using standard errors
2.5 % 97.5 %
(Intercept) 109.8968 220.7776
purposecredit_card -0.3700 0.1646
purposedebt_consolidation -0.0982 0.3113
purposeeducational -0.2727 0.5273
purposehome_improvement -0.1985 0.5155
purposemajor_purchase -0.1847 0.6446
purposesmall_business 0.0043 0.7442
int.rate -356.0414 -200.9517
installment 0.0000 0.0010
log.annual.inc 0.4983 0.8234
dti -0.0233 0.0017
fico 0.0450 0.0540
days.with.cr.line 0.0001 0.0001
revol.bal -0.0001 -0.0001
revol.util -0.0040 0.0034
inq.last.6mths -1.0806 -0.9711
delinq.2yrs -0.6591 -0.2226
pub.rec -1.5922 -0.0760
not.fully.paidTRUE -0.4204 -0.0405
has.delinq.2yrsTRUE 0.3279 1.1061
has.pub.recTRUE 0.1461 1.9161
log.revol.bal 0.3073 0.4825
has.revol.balTRUE -2.4130 -0.9314
dist_new 112.6831 197.3128

6.3.2 Model evaluation

6.3.2.1 Confusion matrix

This is just one of the many libraries you can find the confusion matrix. It is easy to use, but not very powerful, lacking ability to choose cutoff value, and it does not give you all the metrics like accuracy, precision, recall, sensitivity, f1 score etc. Nonetheless, it’s handy.

loadPkg("regclass")
# confusion_matrix(credit.policyLogit)
xkabledply( confusion_matrix(credit.policyLogit), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted FALSE Predicted TRUE Total
Actual FALSE 1238 630 1868
Actual TRUE 262 7448 7710
Total 1500 8078 9578
unloadPkg("regclass")

6.3.2.2 Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.

loadPkg("pROC") # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
prob=predict(credit.policyLogit, type = "response" )
loans$prob <- NA
loans$prob=prob
h <- roc(credit.policy~prob, data=loans)
auc(h) # area-under-curve prefer 0.8 or higher.
plot(h)

# unloadPkg("pROC")
head(loans)

We have here the area-under-curve of 0.9360791, which is more than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is considered a good fit.

6.3.2.3 END

6.3.2.4 McFadden

McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.

credit.policyNullLogit <- glm(credit.policy ~ 1, data = loans, family = "binomial")
mcFadden = 1 - logLik(credit.policyLogit)/logLik(credit.policyNullLogit)
mcFadden

Or we can use other libraries. The pscl (Political Science Computational Lab) library has the function pR2() (pseudo-\(R^2\)) will do the trick.

loadPkg("pscl") # use pR2( ) function to calculate McFadden statistics for model eval
credit.policyLogitpr2 = pR2(credit.policyLogit)
credit.policyLogitpr2
unloadPkg("pscl") 

With the McFadden value of 0.5299687, which is analogous to the coefficient of determination \(R^2\), only about 51% of the variations in y is explained by the explanatory variables in the model.

A major weakness of the overall model is likely from the small dataset sample size of 9578. We expect a much higher number of observations will increase the sensitivity of the model.

6.3.2.5 Hosmer and Lemeshow test

The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.

loadPkg("ResourceSelection") # function hoslem.test( ) for logit model evaluation
credit.policyLogitHoslem = hoslem.test(loans$credit.policy, fitted(credit.policyLogit)) # Hosmer and Lemeshow test, a chi-squared test

The result is shown here:

credit.policyLogitHoslem
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  loans$credit.policy, fitted(credit.policyLogit)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.

The p-value of 0 is relatively low. This indicates the model is y a good fit, despite all the coefficients are significant.

7 Logistic Regression Not Fully Paid

7.1 Effects on not.fully paid by purpose

To study the effects on admission by the factor purpose (not.fully.paid and purpose are both categorical variables), wee can create two-way contingency table of the outcome and predictors, and make sure there are no cells of zero frequencies.
*A contingency table, sometimes called a two-way frequency table, is a tabular mechanism with at least two rows and two columns used in statistics to present categorical data in terms of frequency counts.

<<<<<<< Updated upstream

not.fully.paidpurposetable = xtabs(~ not.fully.paid + purpose, data = loans)
not.fully.paidpurposetable

7.2 feature selection

set.seed(7)
# load the library
library(mlbench)
library(caret)
loans$not.fully.paid = as.factor(loans$not.fully.paid)
# load the dataset
# prepare training scheme
control <- trainControl(method="repeatedcv", number=10, repeats=3)
# train the model
class(loans$not.fully.paid)
model <- train(not.fully.paid~., data=loans, method="glm", preProcess="scale", trControl=control)
# estimate variable importance
importance <- varImp(model, scale=FALSE)
# summarize importance
print(importance)
# plot importance
plot(importance)

#names(getModelInfo())

7.3 Logistic Regression Model

loans$not.fully.paid <- factor(loans$not.fully.paid)
str(loans)
loans$purpose <- factor(loans$purpose)
#str(not.fully.paid)
not.fully.paidLogit <- glm(not.fully.paid ~., data = loans, binomial(link = "logit") )  

We can see the summary of the logit model here:

summary(not.fully.paidLogit)
xkabledply(not.fully.paidLogit, title = "Logistic Regression :")
Logistic Regression :
Estimate Std. Error z value Pr(>|z|)
(Intercept) 92.9712 20.4112 4.5549 0.0000
credit.policyTRUE 0.0219 0.1014 0.2160 0.8290
purposecredit_card -0.5212 0.1103 -4.7237 0.0000
purposedebt_consolidation -0.3179 0.0785 -4.0508 0.0001
purposeeducational 0.0612 0.1535 0.3984 0.6903
purposehome_improvement 0.1053 0.1270 0.8287 0.4073
purposemajor_purchase -0.2921 0.1671 -1.7482 0.0804
purposesmall_business 0.6062 0.1175 5.1589 0.0000
int.rate -120.2849 28.5744 -4.2095 0.0000
installment 0.0012 0.0002 6.7618 0.0000
log.annual.inc -0.3398 0.0611 -5.5606 0.0000
dti -0.0009 0.0048 -0.1873 0.8514
fico -0.0042 0.0015 -2.7314 0.0063
days.with.cr.line 0.0000 0.0000 1.2863 0.1983
revol.bal 0.0000 0.0000 -0.0224 0.9822
revol.util 0.0036 0.0014 2.5110 0.0120
inq.last.6mths 0.0018 0.0175 0.1049 0.9164
delinq.2yrs -0.1538 0.0987 -1.5584 0.1191
pub.rec -1.6200 0.7300 -2.2190 0.0265
has.delinq.2yrsTRUE 0.1286 0.1617 0.7955 0.4263
has.pub.recTRUE 2.0944 0.7518 2.7858 0.0053
log.revol.bal 0.0056 0.0302 0.1858 0.8526
has.revol.balTRUE -0.1548 0.2738 -0.5654 0.5718
dist_new 65.9301 15.5910 4.2287 0.0000
prob -1.3941 0.1912 -7.2898 0.0000

Before moving on, let us look at the model object not.fully.paidLogit a little deeper. The fitted values can be found from not.fully.paidLogit$fitted.values. And the first fitted value is 0.1465626. This is the probability of being not.fully.paid for data point #1. Compare to the value from predict(not.fully.paidLogit) to be -1.7618196.

The predict() function gives us the logit value. You can exponentiate to get the odds ratio p/q as 0.1717321. And finally, we can find p from p/q, and indeed it is confirmed to be 0.1465626.

The easier way to get that is simply use predict(not.fully.paidLogit, type = c("response"))[1] = 0.1465626. The predict() function will also allow you to find model prediction with unseen/untrained data points where fitted.values do not give.

<<<<<<< Updated upstream

p_fitted = not.fully.paidLogit$fitted.values[1] # this is the model predicated value p-hat for the first data row (not the actual data point p)  
p_fitted

This is stored in the model as the fitted value for the probability p of the first data point. Since the actual data point is a 0/1 True/False value, it is not easy to directly compare them unless we use a cutoff value (default as 0.5) to convert the probability p to 0/1.

Now, for unseen data point, we can use the predict( ) function to find the model predicted values. But be careful of what you are getting with the predict() function in classification models. Let us compare the three options below. For easy comparison, let us use the same data point in the dataset as an example.

# This gives you the predicted values of the data points inside the model.
predict(not.fully.paidLogit)  # the is from the model, which gives you the value for logit(p) or ln(p/q) 

7.3.1 Confidence Intervals

We can easily determin the confidence intervals of each coefficient with these two slightly different ways:

## CIs using profiled log-likelihood
# confint(not.fully.paidLogit)
xkabledply( confint(not.fully.paidLogit), title = "CIs using profiled log-likelihood" )
CIs using profiled log-likelihood
2.5 % 97.5 %
(Intercept) 53.4238 133.4683
credit.policyTRUE -0.1757 0.2220
purposecredit_card -0.7397 -0.3069
purposedebt_consolidation -0.4713 -0.1637
purposeeducational -0.2455 0.3570
purposehome_improvement -0.1470 0.3513
purposemajor_purchase -0.6295 0.0267
purposesmall_business 0.3747 0.8355
int.rate -176.9523 -64.8951
installment 0.0009 0.0016
log.annual.inc -0.4599 -0.2203
dti -0.0102 0.0084
fico -0.0072 -0.0012
days.with.cr.line 0.0000 0.0000
revol.bal 0.0000 0.0000
revol.util 0.0008 0.0064
inq.last.6mths -0.0329 0.0360
delinq.2yrs -0.3635 0.0244
pub.rec -3.4481 -0.4616
has.delinq.2yrsTRUE -0.1800 0.4547
has.pub.recTRUE 0.8760 3.9498
log.revol.bal -0.0530 0.0652
has.revol.balTRUE -0.6933 0.3808
dist_new 35.7106 96.8514
prob -1.7699 -1.0194
## CIs using standard errors
# confint.default(not.fully.paidLogit)
xkabledply( confint.default(not.fully.paidLogit), title = "CIs using standard errors" )
CIs using standard errors
2.5 % 97.5 %
(Intercept) 52.9659 132.9765
credit.policyTRUE -0.1769 0.2207
purposecredit_card -0.7374 -0.3049
purposedebt_consolidation -0.4716 -0.1641
purposeeducational -0.2397 0.3620
purposehome_improvement -0.1437 0.3542
purposemajor_purchase -0.6195 0.0354
purposesmall_business 0.3759 0.8365
int.rate -176.2896 -64.2802
installment 0.0009 0.0016
log.annual.inc -0.4596 -0.2200
dti -0.0102 0.0084
fico -0.0072 -0.0012
days.with.cr.line 0.0000 0.0000
revol.bal 0.0000 0.0000
revol.util 0.0008 0.0064
inq.last.6mths -0.0325 0.0362
delinq.2yrs -0.3473 0.0396
pub.rec -3.0508 -0.1891
has.delinq.2yrsTRUE -0.1883 0.4455
has.pub.recTRUE 0.6209 3.5680
log.revol.bal -0.0535 0.0647
has.revol.balTRUE -0.6915 0.3819
dist_new 35.3723 96.4879
prob -1.7689 -1.0193

7.3.2 Model evaluation

7.3.2.1 Confusion matrix

This is just one of the many libraries you can find the confusion matrix. It is easy to use, but not very powerful, lacking ability to choose cutoff value, and it does not give you all the metrics like accuracy, precision, recall, sensitivity, f1 score etc. Nonetheless, it’s handy.

loadPkg("regclass")
# confusion_matrix(not.fully.paidLogit)
xkabledply( confusion_matrix(not.fully.paidLogit), title = "Confusion matrix from Logit Model" )
Confusion matrix from Logit Model
Predicted FALSE Predicted TRUE Total
Actual FALSE 7999 46 8045
Actual TRUE 1486 47 1533
Total 9485 93 9578
unloadPkg("regclass")

7.3.2.2 Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)

Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC) measures the true positive rate (or sensitivity) against the false positive rate (or specificity). The area-under-curve is always between 0.5 and 1. Values higher than 0.8 is considered good model fit.

loadPkg("pROC") # receiver operating characteristic curve, gives the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The curve is on sensitivity/recall/true-positive-rate vs false_alarm/false-positive-rate/fall-out.
prob=predict(not.fully.paidLogit, type = "response" )
loans$prob <- NA
loans$prob=prob
h <- roc(not.fully.paid~prob, data=loans)
auc(h) # area-under-curve prefer 0.8 or higher.
plot(h)

# unloadPkg("pROC")
head(loans)

We have here the area-under-curve of 0.6932366, which is less than 0.8. This test also agrees with the Hosmer and Lemeshow test that the model is not considered a good fit.

7.3.2.3 END

7.3.2.4 McFadden

McFadden is another evaluation tool we can use on logit regressions. This is part of what is called pseudo-R-squared values for evaluation tests. We can calculate the value directly from its definition if we so choose to.

not.fully.paidNullLogit <- glm(not.fully.paid ~ 1, data = loans, family = "binomial")
mcFadden = 1 - logLik(not.fully.paidLogit)/logLik(not.fully.paidNullLogit)
mcFadden

Or we can use other libraries. The pscl (Political Science Computational Lab) library has the function pR2() (pseudo-\(R^2\)) will do the trick.

loadPkg("pscl") # use pR2( ) function to calculate McFadden statistics for model eval
not.fully.paidLogitpr2 = pR2(not.fully.paidLogit)
not.fully.paidLogitpr2
unloadPkg("pscl") 

With the McFadden value of 0.0771848, which is analogous to the coefficient of determination \(R^2\), only about 8% of the variations in y is explained by the explanatory variables in the model.

A major weakness of the overall model is likely from the small dataset sample size of 9578. We expect a much higher number of observations will increase the sensitivity of the model.

7.3.2.5 Hosmer and Lemeshow test

The Hosmer and Lemeshow Goodness of Fit test can be used to evaluate logistic regression fit.

loadPkg("ResourceSelection") # function hoslem.test( ) for logit model evaluation
not.fully.paidLogitHoslem = hoslem.test(loans$not.fully.paid, fitted(not.fully.paidLogit)) # Hosmer and Lemeshow test, a chi-squared test

The result is shown here:

not.fully.paidLogitHoslem
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  loans$not.fully.paid, fitted(not.fully.paidLogit)
## X-squared = 9578, df = 8, p-value < 2.2e-16
# Have not found a good way to display it.

The p-value of 0 is relatively high. This indicates the model is not really a good fit, despite all the coefficients are significant.